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COMMUNICATION-EFFICIENT SPARSE REGRESSION: A 
ONE-SHOT APPROACH 


By Jason D. Lee*, Qiang Liu 1 ', Yuekai Sun*, and Jonathan E. 

Taylor* 

Stanford University* and Dartmouth College 1 

We devise a one-shot approach to distributed sparse regression in 
the high-dimensional setting. The key idea is to average “debiased” or 
“desparsified” lasso estimators. We show the approach converges at 
the same rate as the lasso as long as the dataset is not split across too 
many machines. We also extend the approach to generalized linear 
models. 


1. Introduction. Explosive growth in the size of modern datasets has 
fueled interest in distributed statistical learning. For examples, we refer to 
Boyd et al. (2011); Dekel et al. (2012); Duchi, Agarwal and Wainwright 
(2012); Zhang, Duchi and Wainwright (2013) and the references therein. The 
problem arises, for example, when working with datasets that are too large 
to fit on a single machine and must be distributed across multiple machines. 
The main bottleneck in the distributed setting is usually communication 
between machines/processors, so the overarching goal of algorithm design is 
to minimize communication costs. 

In distributed statistical learning, the simplest and most popular approach 
is averaging: each machine forms a local estimator 6 & with the portion of the 
data stored locally, and a “master” averages the local estimators to produce 
an aggregate estimator: 8 = ^ A Averaging was first studied by Mc- 

donald et al. (2009) for multinomial regression. They derive non-asymptotic 
error bounds on the estimation error that show averaging reduces the vari¬ 
ance of the local estimators, but has no effect on the bias (from the central¬ 
ized solution). In follow-up work, Zinkevich et al. (2010) studied a variant 
of averaging where each machine computes a local estimator with stochas¬ 
tic gradient descent (SGD) on a random subset of the dataset. They show, 
among other things, that their estimator converges to the centralized esti¬ 
mator. 

More recently, Zhang, Duchi and Wainwright (2013) studied averaged 
empirical risk minimization (ERM). They show that the mean squared error 
(MSE) of the averaged ERM decays like 0(N ~2 + ^), where m is the 
number of machines and N is the total number of samples. Thus, so long 
as m < y/~N, the averaged ERM matches the N~ 2 convergence rate of the 

1 

imsart-aoas ver. 2013/03/06 file: averaging_ims.tex date: August 12, 2015 


2 


LEE ET AL. 


centralized ERM. Even more recently, Rosenblatt and Nadler (2014) studied 
the optimality of averaged ERM in two asymptotic settings: N —> oo, m,p 
fixed and p, n —> oo, - —>• pi E (0,1), where n = ^ is the number of 
samples per machine. They show that in the n —> oo, p fixed setting, the 
averaged ERM is first-order equivalent to the centralized ERM. However, 
when p, n —> oo, the averaged ERM is suboptimal (versus the centralized 
ERM). 

We develop an approach to distributed statistical learning in the high-di¬ 
mensional setting. Since p > n, regularization is essential. At a high level, 
the key idea is to average local debiased regularized M-estimators. We show 
that our averaged estimator converges at the same rate as the centralized 
regularized M-estimator. 

2. Background on the lasso and the debiased lasso. To keep 
things simple, we focus on sparse linear regression. Consider the sparse linear 
model 

y = xp* + e, 

where the rows of X E R nxp are predictors, and the components of y E R 71 
are the responses. To keep things simple, we assume 

(Al) the predictors x E R p are independent subgaussian random vectors 
whose covariance E has smallest has smallest eigenvalue A m i n (E); 

(A2) the regression coefficients j3* E R p are s-sparse, i.e. all but s compo¬ 
nents of j3* are zero; 

(A3) the components of e E R” are independent, mean zero subgaussian 
random variables. 

Given the predictors and responses, the lasso estimates /3* by 

P := argmin ||y — Al/3||| + A||/3||i. 

/3eRJ> 2n 

There is a well-developed theory of the lasso that says, under suitable as¬ 
sumptions on X, the lasso estimator /3 is nearly as close to /3* as the oracle es¬ 
timator: Xl y (e.g. see Hastie, Tibshirani and Wainwright (2015), Chap¬ 
ter 11 for an overview). More precisely, under some conditions on ^X T X, 
the MSE of the lasso estimator is roughly gl ° SP . Since the MSE of the oracle 
estimator is (roughly) the lasso estimator is almost as good as the oracle 
estimator. 
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However, the lasso estimator is also biased 1 . Since averaging only reduces 
variance, not bias, we gain (almost) nothing by averaging the biased lasso 
estimators. That is, it is possible to show if we naively averaged local lasso 
estimators, the MSE of the averaged estimator is of the same order as that 
of the local estimators. The key to overcoming the bias of the averaged lasso 
estimator is to “debias” the lasso estimators before averaging. 

The debiased lasso estimator by Javanmard and Montanari (2013a) is 

( 2 . 1 ) ^:=P + -@X T (y-XP), 

n 


where j3 is the lasso estimator and 0 G R pxp is an approximate inverse 
to S = nX T X. Intuitively, the debiased lasso estimator trades bias for 
variance. The trade-off is obvious when E is non-singular: setting 0 = E _1 
gives the ordinary least squares (OLS) estimator ( X T X)~ 1 X T y. 

Another way to interpret the debiased lasso estimator is a corrected esti¬ 
mator that compensates for the bias incurred by shrinkage. By the optimality 
conditions of the lasso, the correction term ^X T (y — XjS) is a subgradient 
of at (3. By adding a term proportional to the subgradient of the 

regularizer, the debiased lasso estimator compensates for the bias incurred 
by regularization. The debiased lasso estimator has previously been used to 
perform inference on the regression coefficients in high-dimensional regres¬ 
sion models. We refer to the papers by Javanmard and Montanari (2013a); 
van de Geer et al. (2013); Zhang and Zhang (2014); Belloni, Chernozhukov 
and Hansen (2011) for details. 

The choice of 0 in the correction term is crucial to the performance of 
the debiased estimator. Javanmard and Montanari (2013a) suggest forming 
0 row by row: the j’-th row of 0 is the optimum of 


( 2 . 2 ) 


minimize 6 T Y,9 
6>eR p 

subject to ||E0 — ejHoo < J. 


The parameter 6 should large enough to keep the problem feasible, but as 
small as possible to keep the bias (of the debiased lasso estimator) small. 

As we shall see, when the rows of X are subgaussian, setting 
usually large enough to keep (2.2) feasible. 


1 We refer to Section 2.2 in Javanmard and Montanari (2013a) for a more formal dis¬ 
cussion of the bias of the lasso estimator. 
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Definition 2.1 (Generalized coherence). Given X E R nxp , let E = 
y i X T X. The generalized coherence between E and 0 E R pxp is 

GC(E, 0) = maxj- e [p] ||E0j - ej ||oo- 

Lemma 2.2 (Javanmard and Montanari (2013a)). Under (Al), when 
16na).n > log p, the event 

£ gc (±) := jGC(E, E _1 ) < 

occurs with probability at least 1 — 2 p~ 2 for some c\ > 0, where k := ') max ,^| 
is the condition number o/E. 


As we shall see, the bias of the debiased lasso estimate is of higher order 
than its variance under suitable conditions on E. In particular, we require 
E to satisfy the restricted eigenvalue (RE) condition. 

Definition 2.3 (RE condition). For any 5 C [p], let 

C(S)-.= {AeW\\\A s 4 1 <3\\Ax s \\ 1 }. 

We say S satisfies the RE condition on the cone C(S) when 

a t ea> w ||a 5 ||! 

for some pi > 0 and any A E C(S). 

The RE condition requires E to be positive definite on C(S). When the 
rows of X G R nx P are i'i'd' Gaussian random vectors, Raskutti, Wainwright 
and Yu (2010) show there are constants 0 such that 

— IIXAII, > hi IIA11 o — H2 — — || A||? for any A G R p 
n n 

with probability at least 1 — C 2 exp (—C 2 n). Their result implies the RE con¬ 
dition holds on C(S) (for any S C [p]) as long as n > IS) logp, even when 
there are dependencies among the predictors. Their result was extended to 
subgaussian designs by Rudelson and Zhou (2013), also allowing for depen¬ 
dencies among the covariates. We summarize their result in a lemma. 

Lemma 2.4. Under (Al), when n > 4000scr^ log( 6C y ep ) and p > s, 
where s := s + 25920 ks, the event 

£re(X) = {a t EA > ^A min (E)||A s ||! for any A G C(5)} 

_ n 

occurs with probability at least 1 — 2e WkLJ . 
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PROOF. The lemma is a consequence of Rudelson and Zhou (2013), The¬ 
orem 6. In their notation, we set 6 = k^ = 3 and bound max Jg [p] 11 1 1 2 

and K(s 0 , k 0 ,T,^) by A max (E) and A min (E) _ T □ 

When the RE condition holds, the lasso and debiased lasso estimators 
are consistent for a suitable choice of the regularization parameter A. The 
parameter A should be large enough to dominate the “empirical process” 
part of the problem: A ||X T y|| oo , but as small as possible to reduce the bias 

incurred by regularization. As we shall see, setting A ~ a y (-^-) 2 is a good 
choice. 


Lemma 2.5. Under (A3), 


— 1| el 
n 


< max 


■j£[p] (Af/j) 2 a y( y 


31ogp^ 


c 2 n 


with probability at least 1 — ep - for any (non-random) X G R nxp . 


When S satisfies the RE condition and A is large enough, the lasso and 
debiased lasso estimators are consistent. 


Lemma 2.6 (Negahban et al. (2012)). Under (A2) and (A3), suppose E 
satisfies the RE condition on C* with constant p,i and A||A T e|| 00 < A, 

\\P ~ P\\i < —s\ and ||(3 - /3\\ 2 < — \/iA. 
m im 

When the lasso estimator is consistent, the debiased lasso estimator is 
also consistent. Further, it is possible to show that the bias of the debiased 
estimator is of higher order than its variance. Similar results by Javanmard 
and Montanari (2013a); van de Geer et al. (2013); Zhang and Zhang (2014); 
Belloni, Chernozhukov and Hansen (2011) are the key step in showing the 
asymptotic normality of the (components of) the debiased lasso estimator. 
The result we state is essentially Javanmard and Montanari (2013a), Theo¬ 
rem 2.3. 

Lemma 2.7. Under the conditions of Lemma 2.6, when (E,0) has gen¬ 
eralized incoherence S, the debiased lasso estimator has the form 

(3 d = /3* + -&X T e + A, 
n 

where IIAIloo < —sA. 

A L l 
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Lemma 2.7, together with Lemmas 2.5 and 2.2, shows that the bias of the 
debiased lasso estimator is of higher order than its variance. In particular, 
setting A and 5 according to Lemmas 2.5 and 2.2 gives a bias term || A Hoc that 
is 0 ( slogp ). By comparison, the variance term ^||0X r e|| oo is the maximum 
of p subgaussian random variables with mean zero and variances of 0(1), 

which is 0((^p) 2 ). Thus the bias term is of higher order than the variance 
term as long as n > s 2 log p. 


Corollary 2.8. Under (A2), (A3), and the conditions of Lemma 2.6, 
when (S, 0) has generalized incoherence * and we set A = max^^)^ 


A 


oo 


< 


3\/3 S' maxj £ [j,] (t/j) 2 _ slogp 

t — 

y/C 2 hi n 


3. Averaging debiased lassos. Recall the problem setup: we are given 
N samples of the form (xi,yi) distributed across m machines: 


'Xi 


yi 


, y = 


_X-m_ 


Vm_ 


The fc-th machine has local predictors X k £ R, nfeXp and responses yk E R nfe . 
To keep things simple, we assume the data is evenly distributed, i.e. n\ = 
■ ■ ■ = n k = n = The averaged debiased lasso estimator (for lack of a 
better name) is 


iib ul 

(3.1) ^ = _V/3f=-V/3 fc + Q k Xl (y k - Xj k ), 

m ' ni 

k=l k =1 

We study the error of the averaged debiased lasso in the £oc norm. 


Lemma 3.1. Suppose the local sparse regression problem on each ma¬ 
chine satisfies the conditions of Corollary 2.8, that is when m < p, 


1. {Sfc}fce[m] satisfy the RE condition on C* with constant pi, 

2. {(Efc, ©fc)}fce[ m ] have generalized incoherence ccc(^p)^> 

3. we set \i = ■■■ = X m = 2 . 

Then 


- (3* 


< co. 


( (cn logp^l 
’ V V N ) 


CGCCE slogp\ 

Ty - I 


hi 


n 




with probability at least 1 — ep , where c > 0 is a universal constant, cq := 
maXjg[pj^g|' m j((0fcSj,0£, )jj) and cs ■ ((^fc)jj)^ ■ 


! 3 _logp\ I 
V C2U ) ’ 
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Lemma 3.1 hints at the performance of the averaged debiased lasso. In 
particular, we note the first term is which matches the conver¬ 

gence rate of the centralized estimator. When n is large enough, slogp is 

negligible compared to and the error is )■ 

Finally, we show the conditions of Lemma 3.1 occur with high probability 
when the rows of X are independent subgaussian random vectors. 


Theorem 3.2. Under (Al), (A2), and (AS), when m < p, p > s, 

1. n> max{4000.?<7^: log( 60x ^ ep ), 8000cr^logp, max{cr^, o-^jlogp}, 

2. we set X 1 = ■ ■ ■ = X m = max je[p]M[m] ((t k ) . .)ha y 5 , 

3. we set ^ = ■ • • = 8 m = 2 and form {® k } k ^ m] by ( 2 . 2 ) 

,15 A* 11 / ( l °SP\ * , ^ma x je{p] ('E j> j)h 2 slogp^ 

W-P - - - j +-X—(£)- 

with probability at least 1 — (8 + e)p~ 1 for some universal constant c > 0. 


Proof. We start with the conclusion of Lemma 3.1: 

n 5 0*11 . / 2 cq logm 5 3^3c G cce slogp 

\\P ~ P Woo < <7 V [ -Tt ) + - Oy -• 

V C2-/V / x/C2 Hi n 

First, we show that the two constants cq = maXj^ p ^ ke t rn ](Q k T lk @ k )jj and 
ce := max Jg [p] fcg [ m ]((Sfc)jj )2 are bounded with high probability. 

Lemma 3.3. Under (Al), 

Pr(maxj g [ p ] E^EEt 1 > 2max Jg [p] E“J-) < 2 pe Cimm ^°f 
for some universal constant c\ > 0 . 


Since we form {0 fc } fce [ m ] by (2.2), 

(0 fc S fc 0fc)i,i < nr ax j g [p ] (S~ 1 SfcS _1 ))j J -. 

Lemma 3.3 implies 

maXj g [ p ](S _1 SfcS _1 ))jj < 2 maxj g [p] T,jj for each k € [m] 

—ci min{} 

with probability at least 1 — 2pe <Tx . 
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—ci min{ —pTj-, } 

1 16o-2 ’4ct x j 


Lemma 3.4. Under (Al), 

Pr(max je [p](S JJ )^ > v^max je [ p ](£.,j)^) < 2 pe 

for some universal constant c\ > 0. 

We put the pieces together to obtain the stated result: 

1. By Lemma 3.3 (and a union bound over k E [m]), 

_ , _ —ci min{} 

Pi*(cq > 2maxj E - •) < 2mpe ^ . 

Since m < p, when n > ^ maxjcr^, <7^} logp, 


-l 


Pr(c n < 2 max S^-j) > I - 2p 

2. By Lemma 3.4 (and a union bound over k E [m]), 

Pr(cs < v / 2maXj e [ p ](Sj J -)^) > 1 — 2mpe C1 mm ^ 

When n > ^ max{<7^, <7 X } logp, the right side is again at most 2 p 

3. By Lemma 2.4, as long as 

n > max{4000sq-^ log( 60v ? ep ), 8000 ct^ logp}, 

Si,..., S m all satisfy the RE condition with probability at least 

_ n 

1 _ 2 me. 4000 ^ > 1 - 2p _1 . 


-i 


4. By Lemma 2.2, 


Pr(n ke[m] £ GC (Sk)) >1-2 p 2 . 

Since m < p, the probability is at least 1 — 2p _1 . 

We apply the bounds c G < 2max Jg[p ] T,jj, c s < \/ 2 max je ^('E j j) 2 , c G c = 
and (Ji = ^A m j n (S) to obtain 


48 v / 6 V / ^max jeb] (S JJ )2 0 slogp 
A min (S) ax(7y n 


□ 
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Fig 1: The estimation error (in norm) of the averaged debiased lasso esti¬ 
mator versus that of the centralized lasso when the predictors are Gaussian. 
In both settings, the estimation error of the averaged debiased estimator is 
comparable to that of the centralized lasso, while that of the naive averaged 
lasso is much worse. 

We validate our theoretical results with simulations. First, we study the 
estimation error of the averaged debiased lasso in norm. To focus on the 
effect of averaging, we grow the number of machines m linearly with the 
(total) sample size N. In other words, we fix the sample size per machine n 
and grow the total sample size N by adding machines. Figure 1 compares 
the estimation error (in norm) of the averaged debiased lasso estimator 
with that of the centralized lasso. We see the estimation error of the averaged 
debiased lasso estimator is comparable to that of the centralized lasso, while 
that of the naive averaged lasso is much worse. 

We conduct a second set of simulations to study the effect of the number 
of machines on the estimation effort of the averaged estimator. To focus on 
the effect of the number of machines k, we fix the (total) sample size N 
and vary the number of machines the samples are distributed across. Figure 
2 shows how the estimation error (in norm) of the averaged estimator 
grows as the number of machines grows. When the number of machines is 
small, the estimation error of the averaged estimator is comparable to that 
of the centralized lasso. However, when the number of machines exceeds a 
certain threshold, the estimation error grows with the number of machines. 
This is consistent with the prediction of Theorem 3.2: when the number of 
machines exceeds a certain threshold, the bias term of order 6 logp becomes 
dominant. 

The averaged debiased lasso has one serious drawback versus the lasso: f3 
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(E mJ,p= 10 4 , nk = 2 x 10 5 ) (Ey = p = 10 4 , nk = 2 x 10 5 ) 

Fig 2: The estimation error (in norm) of the averaged estimator as the 
number of machines k vary. When the number of machines is small, the error 
is comparable to that of the centralized lasso. However, when the number 
of machines exceeds a certain threshold, the bias term (which grows linearly 
in k) is dominant, and the performance of the averaged estimator degrades. 


is usually dense. The density of f3 detracts from the intrepretability of the 
coefficients and makes the estimation error large in the £2 and i\ norms. To 
remedy both problems, we threshold the averaged debiased lasso: 

HTt(/3) Pj ■ 1{|^|>*}> 

ST t (p) sign (Pj) • max{|^| - t, 0}. 

As we shall see, both hard and soft-thresholding give sparse aggregates 
that are close to P* in i 2 norm. 

Lemma 3.5. As long as t > || p — /3*||oo> P ht '■= HTt(/0) satisfies 

1. \\P ht -P*\\oo<2t, 

2. \\p ht - P*\\ 2 <2V2st, 

3. \\P ht - P*\\i <2V2 st. 

The analogous result also holds for p st := ST t{P)- 
Proof. By the triangle inequality, 

\\~P ht -P*\\oo<\\~P ht -P\\oo + \\P-P*\\oo 
<t+\\P-P* IL 
< 2 1 . 
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Since t > ||/3 — , /jj 1 * = 0 whenever = 0. Thus (3 ht is s-sparse and 

(3 ht — (3* is 2s-sparse. By the equivalence between the and £ 2 , £\ norms, 

W ht -P*h < 2 V 2 st, 

\\/3 ht — /3*||i < 2^2 st. 

The argument for j3 st is similar. □ 

By combining Lemma 3.5 with Theorem 3.2, we show that (3 ht converges 
at the same rates as the centralized lasso. 


Theorem 3.6. Under the conditions of Theorem 3.2, hard-thresholding 


(3 at (T y 


4maxjg[ pl SM logp^ 2 48^ V^ max jeM ( S J,j) 2 ^2 „ slogp 


c 2 N 

ht «*|| < 




+ 


Amin(S) 


(J x (J y~ 


gives 


00 


1. \\0 nt — f3 

2 . wr-Pu \2 ^p<t, 

3- W ht -I3*\\l<P°y 


~P a y 

ht _ A*||» iiy 


max je[p] T/,j I°SP\ 2 \/^ max i6[p]slogp 

) ^ A min (E) 


N 

j E. 

N 


a x a y n ) 


max ie[p] 2 + yTmax Jg[p] (Sj,j )2 


2 n s 2 Iogp 
U X U V n J 


max j g [p] j s 2 log P ^ 2 ! v^maxj6[p](%,i) J 2 s 2 logp 

N ) A min (£) n ' 


Remark 3.7. Ry Theorem 3.6, when m < a %^ gp , the variance term is 
dominant and the convergence rates given by the theorem simplify: 


1. \\P ht 

2. \\p ht 

3. \\p ht 


(3* 11oo <p 

Fh Sp ( 

nil <p( 


/' logP A I 

\ N ) i 

s l°g P A 5 
N ) > 
s 2 log P A 2 
N ) 


The convergence rates for the centralized lasso estimator (3 are identical 
(modulo constants): 

i ■ ii/S-nioo<p 

2 . ii/9-mi 2 <p (Pf) 1 , 

5 . iP-nii<P (*>*)*. 

The estimator (3 ht matches the convergence rates of the centralized lasso in 
l\, I 2 , andd-oo norms. Furthermore, (3 ht can be evaluated in a communication- 
efficient manner by a one-shot averaging approach. 


We conduct a third set of simulations to study the effect of thresholding 
on the estimation error in £2 norm. Figure 3 compares the estimation error 
incurred by the averaged estimator with and without thresholding versus 
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(E = I,p= 10 4 , n = 5 x 10 3 ) (£y = 0.5 |i_il , p = 10 4 , n = 5 x 10 3 ) 

Fig 3: The estimation error (in i 2 norm) of the averaged estimator with and 
sans thresholding versus that of the centralized lasso when the predictors are 
Gaussian. In both settings, thresholding reduces the estimation error by or- 
der(s) of magnitude. Although the estimation error of the averaged estimator 
is large compared to that of the centralized lasso, the thresholded averaged 
estimator performs comparably, or even better than, the centralized lasso. 


that of the centralized lasso. Since the averaged estimator is usually dense, 
its estimation error (in I 2 norm) is large compared to that of the centralized 
lasso. However, after thresholding, the averaged estimator performs compa¬ 
rably versus the centralized lasso. 

4. A distributed approach to debiasing. The averaged estimator 
we studied has the form 

- m 

P = — Pk + ®kX k (y - X k p k ). 

k= 1 

The estimator requires each machine to form 0^ by the solution of (2.2). 
Since the dual of ( 2 . 2 ) is an ti-regularized quadratic program: 

(4.1) minimize - t k 7 + 5 IItIU , 

7ER p z 

forming 0 *, is (roughly speaking) p times as expensive as solving the local 
lasso problem, making it the most expensive step (in terms of FLOPS) of 
evaluating the averaged estimator. To trim the cost of the debiasing step, 
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we consider an estimator that forms only a single © : 

m m 

(4.2) fi= — Pk + X k (y - x k (3 k ). 

k= 1 /c=l 

To evaluate (4.2), 

1. each machine sends p k and ^X£(y — X k j3 k ) to a central server, 

2. the central server forms X YX= l A an d ^ XX=i (y - X k ji k ) and 
sends the averages to all the machines, 

3. each machine, given the averages, forms — rows of © and debiases — 
coefficients: 

. m 1 m 

ft = - £ ft+A ((»- w4)), 

1 fc=i fc=i 


where Qj : . G R p is a row vector. 

As we shall see, each machine can perform debiasing with only the data 
stored locally. Thus, forming the estimator (4.2) requires two rounds of com¬ 
munication. 

The question that remains is how to form ©j .. We consider an estimator 
proposed by van de Geer et al. (2013): nodewise regression on the predictors. 
For some j G \p\ that machine k is debiasing, the machine solves 

7 j ■= argmin \\X k j - X k _^\\l + A J -||'y||i, j G [p], 

7£R p- 1 Zn 


where X k _ j G R" X G b is X k less its j-th column X k ] . Implicitly, we are 


forming 

1 - 71,2 

• • • -7i, p 

A 

—72,1 1 

• • • “72, p 

C : = 

1 

1 

1 

# ••• 

to 

— 7 p,p. 

where the components of jj are indexed by k G {1,. 

We scale the rows of C by diag (Jf j , . .. 

f p ] j, where 

G = ( — \\Xj ~ X-j*fj 

1 + 111 , 


to form © = T Z C. Each row of 0 is given by 


(4.3) 


1 


%■ =-Ta [T 


T i 


7j,j-1 1 7j,i+i 


7j,pJ 
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Since •jj and fj only depend on X they can be formed without any com¬ 
munication. 

Before we justify the choice of 0 theoretically, we mention that it is a 
approximate “inverse” of E (in a component-wise sense). By the optimality 
conditions of nodewise regression, 

= -II x i - x -nMl + 4 llTilli 
= l -U, - X-M2 + \<,Xj - X-jijfX-jij 
= - x -ii )■ 


Recalling the defintition of 0, we have 


' = —(Xj - X-j) T Xj = A, and 


r;n 

1 . rp „ 1 

_ D YI y — — 

-^-jlloo — ~2 

T 3 


n 


-{Xi - Tj x-iY x- 3 


A,- 

< 4 

oo rf 


for any j G [p], Thus 
(4.4) 


max ||0,-.E — e,-||oo < 4. 
je[p] ' ’ Tf 


van de Geer et al. (2013) show that when the rows of X are i.i.d. subgaus- 
sian random vectors and the precision matrix E _1 is sparse, 0y. converges to 
EJ 1 at the usual convergence rate of the lasso. For completeness, we restate 
their result. 

We consider a sequence of regression problems indexed by the sample size 
N, dimension p, sparsity so that satisfies (Al), (A2), and (A3). As N grows 
to infinity, both p = p(N) and s = s(N ) may also grow as a function of N. 
To keep notation manageable, we drop the index N. We further assume 


(A4) the covariance of the predictors (rows of X) has smallest eigenvalue 
A m i n (E) ~ H(l) and largest diagonal entry max ]6 ^ Ejj ~ 0(1), 

(A5) the rows of E _1 are sparse: max Jg [ p ] 3 n gP ~ o(l), where Sj is the 
sparsity of E J 1 . 


Lemma 4.1 (van de Geer et al. (2013), Theorem 2.4). Under (A1)-(A5), 
(4.3) with suitable parameters A j ~ (4p) 5 satisfies 




! logp\ 


n 


j for any j € \p\. 
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We show that the averaged estimator (4.2) matches the convergence rate 
of the centralized lasso. 


Theorem 4.2. Under (Al)-(A5), (4.2), where 0 is given by (4.3), with 
suitable parameters \j,\ k ~ j G [p], fc G [m] satisfies 

|| n o*|| < ( l °SP \ 2 '‘’max log p 

II' 3 '' 3 IU ~ P {-&-) + - n -• 

where s max := max{s 0 , si,..., s p }. 


Proof. We start by substituting the linear model into (4.2): 

1 m 1 

P = - Y\ fa ~ &£k(Pk - fi*) + -©Af e k 
in n 

k =1 

1 m 1 

= - V A - ©£*(& - fi*) + T7©A T e. 
m ' TV 

fe=i 

Subtracting /3* and taking norms, we obtain 

(4.5) ||/3 - P*\\oo < - V II(/ - Q±k)(Pk - r)l|oo + ||^©A T e 

fc=i 


By Vershynin (2010), Proposition 5.16, and Lemma (3.3), it is possible to 
show that 


—©JWe 
1 TV llo ° 


^ p 

r^j-i 



1 

2 


We turn our attention to the first term in (4.5). It’s straightforward to see 
each term in the sum is bounded by 


ll(/-0s fc )(4-/r)lloo 

< ||(7 - S” 1 S/ c )(/3fc - /3*)Hoo + IKS” 1 - ©)£*(& - n\\oo 

< maxj G [pj ||ej - ST 1 S fe || 0O ||/3 fe - /3*||i + USE 1 - || x ||52 fc (/3 fc - /3*)||oo 

We put the pieces together to deduce each term is O ( ;Smax n logp ) : 

1. By Lemmas 2.4, 2.6, 3.4, \\fi k - /3*||i <p A fc . 

2. By Lemma 4.1, USE 1 - ©jJ], < P Sj(^) T 
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3. By the triangle inequality, 


|Sfc(& - n\\oo < ~Xl{y k - Xj k ) 


n 


+ 


-xu k 


n 


By the optimality conditions of the (local) lasso estimators, the first 
term is X k , and it is possible to show, by Lemma 3.3 and Vershynin 

(2010), Proposition 5.16, that the second term is Op((^p) 5 )- 


Since X k ~ (^p) 2 , by a union bound over k £ [m], we obtain 

w-nu ~o P ( 

where s max := max{s 0 , si,..., s p }. 


log p\h s max 
N ) + 


n 




□ 


By combining the Lemma 3.5 with Theorem 4.2, we can show that (3 ht := 
HT(/3, t) for an appropriate threshold t converges to j3* at the same rates as 
the centralized lasso. 


Theorem 4.3. Under the conditions of Theorem f.2, hard-thresholding 
P att~ + Smax n Iogp gives 


1 . \\p ht 

2. \\fi ht 

3. \\p ht 


«*H <„ / jggp l h I ■Smax logp 

P Moo r^P 1 JV J + n J 

a*\\ < (sjUpgpY- + max I°gP 

P 112 r^P V N } ' n 

a* li <„ ( s n l °SP \h I logP 

P 111 nuP 1 JV j + n 


Theorem 4.3 shows that for m < -, the variance term is dominant, 

so the convergence rates simplify: 

1 .11 P’“-nn<p('w)K i 

2 . llfl“-/3*ll2<pffe7T ,t }^ 

3- ||^-rlll<p(^)l 

Thus, estimator /3 ht shares the advantages of ]3 ht over the centralized lasso 
(cf. Remark 3.7). It also achieves computational gains over j3 ht by amortizing 
the cost of debiasing across m machines. 


5. Averaging debiased £\ regularized M-estimators. The distri¬ 
buted approach to debiasing extends readily to regularized M-estimators. 
As before, we are given N pairs ( Xi,yi ) stored on m machines. Let p(yi,a ) 


imsart-aoas ver. 2013/03/06 file: averaging_ims.tex date: August 12, 2015 





















COMMUNICATION-EFFICIENT SPARSE REGRESSION 17 

be a loss function function, which is convex in a, and p, p be its derivatives 
with respect to a. That is 

d d 2 

p{y , a) = ^p{y, a), p(y, a) = P(v> a )- 

We define £k(P) = ~ X^=i PiVit X JP )> where the sum is only over the pairs 
on machine k. The averaged estimator is 

1 m j m 

(5.1) p := - V fa + ©(- V V4C4)), 

m z ' \m z ' / 

k =1 fc=l 

where ftk is the local 4 regularized M-estimator: /4 := argming gRP 4(/5) + 
Afc||/3||i. As before, we form 0 by nodewise regression on the weighted design 
matrix X^ := Xk, where is diagonal and its diagonal entries are 

( w $ k )i,i : =p(yi’ x fPk)^- 

That is, for some j E [p] that machine k is debiasing, the machine solves 
7 j ■= X k,i ~ X h-P Ha + A ill7llu j G [P], 

and forms 

= — ^2 [7?',i ••• Tj'j'-i 1 7jj'+i ••• lj,p\ 1 

T j 

where 

Tj = -pjWl + A ill7jlli) 2 - 

We assume 

(Bl) the pairs {(a?*, 2/i)}*e[ N] are i-i.d.; the predictors are bounded: 

max ?e[iV] 11 Xi | |oo iS 1) 

the projection of Xp* 3 on TZ(Xp*-j) in the E [V 2 4(/3*)] inner product 
is bounded: \\Xp*-j r yp* t j\\ 00 < 1 for any j E [p], where 

7/3* J : = a rgminE [||A>j - Ap* -j7lll] • 

7GRP - 1 

(B2) the rows of E [V 2 4(/3*)] 1 are sparse: maxj g [p] J n SP ~ o(l), where 
Sj is the sparsity of (E [V 2 4(/l*)] X ) ... 
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(B3) the smallest eigenvalue of E [V 2 4(/3*)] is bounded away from zero 
and its entries are bounded. 

(B4) for any (3 such that \\/3 — f3*\\i < <5 for some 5 > 0, the diagonal entries 
of Wp stays away from zero, and 

\p(y,x T /3) -p(y,x T /3*)\ < \x T (f3 - /3*)\. 

(B5) we have ±\\X k 0 k - /3*)\\\ <p s 0 X 2 k and \\j3 k - /?*||i <p s 0 X k . 

(B 6 ) the derivatives p(y,a), p(y,a) is locally Lipschitz: 

max ie[ v] supi^^t^i^ sup„ ]piy, f a Z P J y,a ' )l < K for some <5 > 0 . 
Further, 

maxj g [jv] sup y \p(y, xj f3) | ~ 0(1), 
maxj g[ 7 v] su P| 0 _ x t / 3»|< ( 5 sup y \p(y, a)| ~ 0 ( 1 ). 

(B7) the diagonal entries of 

E[v 2 4(r)] _1 E[V4(/3*)V4(/3*) t ] E[v 2 4(r)] 

are bounded. 

Assumption (B5) not necessary; it is implied by the other assumptions. 
We refer to Biihlmann and Van De Geer (2011), Chapter 6 for the details. 
Here we state it as an assumption to simplify the exposition. We show the 
averaged estimator (5.1) achieves the convergence rate of the centralized 
4-regularized M-estimator. 

Theorem 5.1. Under (B1)-(B7), (5.1) with suitable parameters 
Xj, X k ~ (^p) 2 , j € [p], k € [m] satisfies 



where s max := max{s 0 , si,..., s p }. 


Proof. The averaged estimator is given by 

. m 

p - fi* = - - ©v4 0 k )0k - n - r- 

m 

k= 1 

By the smoothness of p, 

p(yi,xjj3 k ) = p(yi,xf /3*) + p(yi,di)xf(fi k - (3*), 
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where 4 is a point between xf /4 and xf j3* . Thus 

m 

p - f = - y\ h - ©(V4(r)+ Qk{h - n) - p* 

k=i 

^ m 

= -®U EE=i v4(/3*)) + -£(/- ©<?»)(& - n 

k=l 

where Qk = f Yfi =i PiVii o,i)Xixf , where the sum is over the data points on 
machine k. Taking norms, we obtain 

. m 

V - n°o < ||e(i Eti V4(/3*))L + - Ell 4 - - 0*)IL- 

A:=l 

It is possible to show that ||0(^ YJk=i V4(/3*)) IL <P (^r) 5 , which 
corresponds to the first term in (5.2). We refer to Biihlmann and Van 
De Geer (2011), Chapter 6 for the details. 

We turn our attention to the second term. By the triangle inequality, 

||(/ — QQk)0k ~ /?*)||oo 

< || (/ - ©v 2 4c4)) 0 k - r)L + ||0 (v 2 4 0 k ) - Qk) 0 k - /niL 

< maxj g [ p ]||ej - Qj,.V 2 £k0k)\\ oo \\Pk - P*\\i 

1 n 

+ \\® x i\\oo\p{yiixfPk) - p{yi,a,i)x[ 0 k ~ P*)\- 


We proceed term by term. By (4.4), 

max ;e[p] M - 0j,-V 2 4(/3 fe )L < ^ ^ 


By van de Geer et al. (2013), Theorem 3.2, 

'max{s 0 , Sj} logp^ \ 


n 


1 

2 


Thus maxjg^] ||ej 

m aXj ^ jp] 


- &j,V 2 4(&)L Sp t^) 1 and, by (B5), 

\ej - %,v 2 4(ft)LiiA - nil <p 
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We turn our attention to the second term. We have ||©Xj||oo i$p 1 because 
||0Xi||oo < max je [p] || 0J wjlloo < maxj g [p] ||©j||oo 
< max^-gfp] T2||(X fc)/9 .)j - (X fct/ 9.)_ J -7 i || 0O . 

T j 

Again, by van de Geer et al. (2013), Theorem 3.2, 


1 


rsjP maX j'e[p] 2 II ~ (^fc,/3*)-j7? l|oO 

T j 

1 

<p max jeW ^||(^fc,/3*)j - ( x k,p*)-j'Yj\\oo 
T i 

^ 2 11 (^fc,/3* ) j 11 oo 11 (7j — 7j ) 111 • 


which, by (Bl) and van de Geer et al. (2013), Theorem 3.2, 

sjlogp 


1 + 


n 


Thus 


1 x ^ „ 

- ll® x *lloo| p(yu x JPk) - p{yi,ai) x I(Pk - P*)\ 

2—1 

1 U 

<p - ^2\p(yi,XiPk) - p(yi,a,i)xf0 k -/3*)|, 

n L —' 1 

2—1 

which, by (B5) and (B6), is at most 


<-|| Xk{Pk-P* 

n 


\i ^p 


So log P 
n 


We put the pieces together to deduce ^ (/ - @Q k )0 k - /3*)|| oo < P 

Smax log P rn 

n 


By combining the Lemma 3.5 with Theorem 4.2, we can show that (3 ht := 
HT(/3, t) for an appropriate threshold t converges to (5* at the same rates as 
the centralized ^-regularized M-estimator. 


Theorem 5.2. Under the conditions of Theorem 5.1, hard-thresholding 
j3att~ (^)* + max *y logp gives 
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1 . \\p ht 

2. \\p ht 

3. \\fi ht 


a* || <„ /' log P \ 2 I ^max logp 

P Moo r^P \ IV/ ' n ’ 

a*\\ < /' sojogp \2 I max logp 

P 112 n~->P \ N ) ' n 

a* || /'«o 1 °gP'\5 i somaxjgfp] Sjlogp 

P 111 ~P V I? ) ~r n " "" 


Assuming so ~ s max , Theorem 5.2 shows when m < ^^- , the variance 
term is dominant, so the convergence rates simplify to 

1- ll/3 M -/3*l|oo <p 

2 . n^-ni2<p 

3 . ii^-nil<p(^)*. 


6. Summary and discussion. We devised a communication-efficient 
approach to distributed sparse regression in the high-dimensional setting. 
The key idea is first “debiasing” local lasso estimators, and then averag¬ 
ing the debiased estimators. We show that as long as the data is not split 
across too many machines, the averaged estimator achieves the convergence 
rate of the centralized lasso estimator. In the appendix, we show that by 
foregoing consistency in the l^ norm, it is possible to further reduce the 
sample complexity of the averaged estimator to that of the centralized lasso 
estimator. Further, the distributed approach to debiasing extends readily 
to other t\ regularized M-estimators. In concurrent work, the approach of 
averaging debiased M-estimators was proposed by Battey et al. (2015) for 
high-dimensional inference. 

In recent years, there has a been a flurry of work on establishing commu¬ 
nication lower bounds for mean estimation in the Gaussian distribution. In 
other words, they establish the minimum communication C needed to obtain 
£% risk R , where ||/3 — ||| < R (Duchi et al., 2014; Garg, Ma and Nguyen, 

2014). These results are not directly applicable to sparse linear regression, 
since they do not impose sparsity on the mean. In Braverman et al. (2015), 
the authors established that to obtain risk R < sl( ^ p at least 0( m ) 
bits of communication is required. Our approach communicates 0(mp) bits 
to achieve risk of , so is communication-optimal when p < n. 


APPENDIX A: PROOFS OF LEMMAS 

PROOF of Lemma 2.2. Let z t = X _ 2 Xj. The generalized coherence be¬ 
tween X and E _1 is given by 


|E _1 E - . 


n 

|-]T(E-L)(£L) 


i —1 
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Each entry of r ' £^!_i(E 2 Zi)(T ,2 Zi) T — I is a sum of independent subexpo¬ 
nential random variables. Their subexponential norms are bounded by 

|| Zi) k - < 2\\(YT^Zi)j{Y^z^kW^- 

Recall for any two subgaussian random variables X, Y, we have 


II-kxII* <2WUIEl*. 


Thus 


||(s 2 Zi) j 2 Zi) k 5 ; 411 (E 2 %i)j\\rj>2 II (S 2 ||^> 2 < 4 

where <7a; = ||zi||^ 2 . By a Bernstein-type inequality, 


1 

Pr(— y^(E~ 2 ^)j(E 2 2 ; i ) fc - S jk >t)< 2e 


-ci minl^,^} 


2—1 


where ci > 0 is a universal constant and <5^ := 4 -^/kcj^. Since cr^n > logp, 
we set t = Tbi (Esp) 2 obtain 

a/ci \ n ) 


Pr ( - 

n 


2—1 ^ 


□ 


We obtain the stated result by taking a union bound over the p 2 entries of 

££?=!(S-Szi)(E^) r -/. ' 

Proof of Lemma 2.5. By Vershynin (2010), Proposition 5.10, 

1 


Pr — \x; e\ > t) < eexp — 


n 


C2n 2 t 2 


|| r T||2 

a yW x i II 2 


< e exp — 


C2Tl 2 t 2 


a y max ie[p] ^jj 

We take a union bound over the p components of ^X T e to obtain 

1 


Pr — ||A e||oo > t < eexp — 


n 


C2n 2 t 2 


+ 


(Ty maXjgjpj Yjj 


logp). 


We set A = max je[p] E JJ „ SV C2n 


j j a y{ 3 c‘n P ) 2 to obtain the desired conclusion. □ 
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PROOF of Lemma 2.7. We start by substituting in the linear model into 
(2.1): 

P d = P+ -&X T (y - X$) = 0* + M±(0* -0) + -MX T e. 
n n 

By adding and subtracting A = /?* — /3, we obtain 

0 d = 0* + -QX T (y - X0) = 0* + (MS - I){0* - 0) + -MX T e. 
n n 

We obtain the expression of 0 d by setting A = (MS — I){0* — 0). 

To show IIAIloo < ^sA, we apply Holder’s inequality to each component 

of A to obtain 

(A.l) |(MS - I){0* - 0)\ < maxj ||EmJ - ejHooH/3 - 0*\h < 5\\0 - 0*\\i, 

where <5 is the generalized incoherence between X and M. By Lemma 2.6, 
\\0 — 0*\\i < |sA. We combine the bound on \\0 — /3*||i with (A.l) to obtain 

the stated bound on HAHoq. O. 


Proof of Lemma 3.1. By Lemma 2.7, 


^ ul ^ HL 

P-F = -. + 


N " m 

k= 1 k =1 


We take norms to obtain 


-nioo < 


^ m ^ m 

E Halloo- 

A ' oc m ■' 


fc=i 


fc=i 


We focus on bounding the first term. Let aj := ej [©iX^ ... @ m A^]. 

By Vershynin (2010), Proposition 5.10, 

c 2 N 2 t 2 


Pr( |— a, e\ > t) < eexp( - 


for some universal constant c 2 > 0. Further, 

rri m 

E W X ^W\l =nY J {®kt k Q T k ) < cnN, 


a j II2 — 


k =1 


k =1 
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where cq := rnax^] ^ 

m] {G k ± k QT) . .. By a union bound over j G [p], 

p / I 1 / C2Nt 2 

Pr^max je[p] |— a,j e\ > tj < eexp( 




+ 


We set t = ( 2c g^ SJ> ) 2 to deduce 

Pr (max je[p] |iaJe| > < ep^ 1 . 

We turn our attention to bounding the second term. By Lemma 2.5 and 
a union bound over j E \p ], when we set 


Ai — • • 


A/, 


= A := 


maxj £ [p],fce [m] 



1 

2 


we have ^H-X’JeHoo < A for any k G [m] with probability at least 1 — ^ > 
1 — ep _1 . By Lemma 2.7, when 


1. {Sfc} fcg [ m ] satisfy the RE condition on C* with constant /p, 

2. {(E fc ,©*)}*<=[„,] have generalized incoherence cgcI 1 ^)^ 


the second term is at most CG ,c Cs a ^ log p _ yy e p U ^ the pieces together to 

y/ C2 fJ'l & 11 

obtain _ 

3\/3cgcce s log p 
H-—-o-„ 


-/F 


< (X 


/2cq log p \ ^ 
n c 2 iV J ^ ^ 




n 


□ 


Proof of Lemma 3.3. We express 


- Ejj + Ejj = - £<»f E.J) 2 - ETl + E7!. 


3,3 


3,3 


i =1 


Since the subgaussian norm of Zj = S 23^ is cr x , xjT,. j is also subgaussian 
with subgaussian norm bounded by 

II x i ^'•jllv , 2 — I|S 2 ^IU 2 I|s.j||2 < 

We recognize ^ Y17=i( x I^-,3) 2 ~ as a sum °f ki.d. subexponential ran¬ 
dom variables with subexponential norm bounded by 


II (xi'E.jr - < 2\\(x I i Z., j y% 1 < 4\\xf 


Ib 2 < 4 ^]- 
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By Vershynin (2010), Proposition 5.16, we have 


,1 ” \ —ci min {- nt _ l } 

p r(— ~ >t)< 2e 4<T “ s w' 

1=1 


for some absolute constant c\ > 0. For t = Yi- j , the bound simplifies to 


i n 

i=l 


-ci min 

1 16CT? ’ 4<7jr J 


We take a union bound over j G \p] to obtain the stated result. 


IB 


Proof of Lemma 3.4. We follow a similar argument as the proof of 
Lemma 3.3: 


1 n 

= ^ 3,3 = ^ 3,3 ~ S 3,3 + ^ 3,3 = - X] X %3 ~ ^ 3,3 + ^ 3,3 ■ 

1=1 


Since the Zi = S 2 a:* is subgaussian with subgaussian norm <r x , Xjj is also 
subgaussian with subgaussian norm bounded by 

1 1 

Il x i,illb2 — — cr x{^‘j,j) 2 ■ 

We recognize Ejj — Hjj = ^ as a sum °f subexponential 

random variables with subexponential norm bounded by 

W^jJ ~ ^idllbi — 2||^jj||i/)i < 4||a:ij||^ 2 < 4a x T,jj. 

By Vershynin (2010), Proposition 5.16, we have 


Pr(Sjj — T,jj > t) < 2e 


—ci mini - 

*-1 


. fT x S.' 


for some absolute constant ci > 0. For t = Sjj, the bound simplifies to 

/\ —ci mini —‘——rr n \ 

- y.'j > < 2e 

We take a union bound over j e [p] to obtain the stated result. □ 
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APPENDIX B: A SHARPER CONSISTENCY RESULT 


It is possible to obtain a sharper consistency result by forgoing the 
norm convergence rate. By sharper, we mean the sample complexity of the 


averaged estimator from m < 


«o 1o §p 


to m < 


so logp' 


Theorem B.l. Under the conditions of Theorem f.2, hard-thresholding 
f3 at t = |/3|(s 0 ) for some so ~ so, i.e. setting all but the largest s' 0 debiased 
coefficients to zero, gives 


1. 


1 ht -P*\\ 2<P (**£») 5 + 


so logp 
n ’ 
3 / 2 , 


2- 11/3“ -/111 <P (*£*)*+ *-1**. 


The sharper consistency result depends on a result by Javanmard and 
Montanari (2013b), which we combine with Lemma 4.1 and restate for com¬ 
pleteness. Before stating the results, we define the (oo, l ) norm of a point 
x E R p as 

ll x ll(oo,Z) •- max Ac[p],|A|>; —JT' 

When l = 1, the (oo,^) norm of x is its norm. When l = p , the (oo ,1) 
norm is the £2 norm (rescaled by ^). Thus the ( 00 , l) norm interpolates 
between the £2 and norms. Javanmard and Montanari (2013b), Theorem 
2.3 shows that the bias of the debiased lasso is of order v ^° logp . 

n 

Lemma B.2. Under the conditions of Theorem 4.2, 

l|Afc||(oo,c's 0 ) for any k E [m] for any d > 0, 

where c is a constant that depends only on d and X. 

Proof. The result is essentially Javanmard and Montanari (2013b), The¬ 
orem 2.3 with Cl = 0 given by (4.3). Lemma 4.1 shows that 


max. 


Je[p] 


IP). y 1 11, ^ n „ / logp t 2 
l U J,- 111 n ) ’ 


Since 


max jgM s j logp ^ o(l), 0 satisfies the conditions of Javanmard and 


Montanari (2013b), Theorem 2.3: 


IA II < c V / ^o 1 °g P 

l^fc||(oo,c'so) 


n 


for any k E [m\, 


The bound is uniform in k E [m] by a union bound for suitable parameters 
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By Lemma B.2, the estimator (4.2) is consistent in the (oo, so) norm. The 
argument is similar to the proof of Theorem 4.2. 

Theorem B.3. Under the conditions of Theorem f.2, 

Proof. We start by substituting the linear model into (4.2): 

1 A. 1 - T 

/ 3 =-^A fc + - 0 X r e. 

m i ' 1 


k =l 


N 


Subtracting j3* and taking norms, we obtain 


(B.l) 


-P* 


(oo.c'so) — 




m 


k =1 


II A T 11 

fc||foo,c's 0 ) + e ll(oo,c , s 0 )' 


By Lemma B.2, the first (bias) term is of order r ^' Sl ). 1 " gp . We focus on showing 

the second (variance) term is of order (^2) 3. Since the (oo, l ) norm is non¬ 
increasing in l, 


\^-eX T e\\ ( , .<|| 4 ©^ T e|| . 

I _/y II (oo,c So) — II _/V I loo 


By Vershynin (2010), Proposition 5.16 and Lemma 3.3, it is possible to show 
that 

M 1 . rp .. 

—ex' r e 

\\ N He 




Thus the second term in (B.l) is of order (^2)3. We put all the pieces 
together to obtain the stated conclusion. □ 

We are ready to prove Theorem B.l. Since (3 ht — /3* is 2so-sparse, 

w ht -nl<s4P ht -n\U C ’ S0) 

or, equivalently, 

\\~p ht -p*h< v^w ht - n\{oo,c' S0 y 

By the triangle inequality, 

W ht - /3*||(oc,c' S o) < W ht - ^11 (oo.c'so) + \\P - n\(oc,c'so) 
<2\\P-P*\\ {oOtC , S0) , 


imsart-aoas ver. 2013/03/06 file: averaging_ims.tex date: August 12, 2015 






28 


LEE ET AL. 


where the second inequality is by the fact that thresholding at t = |/3|( c / so ) 
minimizes \\/3 — /0*||( oo , c , s o ) over c'so-sparse points /3. Thus 

11/3“ - rib + 

To complete the proof of Theorem B.l, we observe that the consistency of 
/3 ht in the norm follows by the fact that f3 ht — (5* is 2so-sparse. 

By Theorem B.l, when m < —^—, the variance term is dominant and 
the convergence rates given by the theorem simplify to the convergence rates 
of the (centralized) lasso estimator: 

1. w ht -n\2<p{^)K 

2. ii^-nii<p (^)®. 

Thus, by forgoing consistency in the £oo norm, it is possible to reduce the 
sample complexity of the averaged estimator to m < S °^ SP . When m = 1, 
we recover the sample complexity of the centralized lasso estimator. 

Theorem B.l requires an estimate of so- To wrap up, we mention that it 
is possible to obtain a good estimate of so by the empirical sparsity of any of 
the local lasso estimators. Let £ C \p ] be the equicorrlation set of the lasso 
estimator. 

{j <E [p] : \xj(y-XP)\ = A}. 

The empirical sparsity so is the size of £. 


Lemma B.4 (Sun (2015), Le m ma. 6.20). Under (A1)-(A3), when 
n > max{4000so(T^ log( 6 °^ ep ), 4000<r^logp, sologp}, 
where sq := sq + 25920kso, we have 


so < 


/ 192a^ + 384A m 

' Amin (E) 


c(S) 


+ 


384V4 

ciA min (S)V S 


with probability at least 1 — 2p ( s °+b. 
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